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answers of examinees to test items may be the results of three possible 
processes: (1) knowing; (2) guessing; and (3) copying. Examinees who do not 
have access to the answers of other examinees can arrive at their answers 
only through the first two processes. This assumption leads to a distribution 
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suspected of copying and the examinee believed to be the source that belongs 
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sets of parameter values are analyzed. It is shown that an extension of the 
test to include matched numbers of correct alternatives would lead to 
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Abstract 

A statistical tests for the detection of answer copying on multiple-choice tests is 
presented. The test is based on the idea that the answers of examinees to test items may be 
the result of three possible processes: (1) knowing, (2) guessing and (3) copying, but that 
examinees who do not have access to the answers of other examinees can arrive at their 
answers only through the first two processes. This assumption leads to a distribution for 
the number of matched incorrect alternatives between the examinee suspected of copying 
and the examinee believed to be the source that belongs to a family of ’’shifted binomials”. 
Power functions for the tests for several sets of parameter values are analyzed. It is shown 
that an extension of the test to include matched numbers of correct alternatives would lead 
to improper statistical hypotheses. 

Key words: Answer Copying; Cheating; Hypothesis Testing; Multiple-Choice 
Testing; Shifted Binomial. 
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A Statistical Ifest for Detecting Answer Copying 
on Multiple-Choice Tfests 

One of the first to derive a statistical test to detect answer copying on multiple-choice 
tests were Frary, Tideman, and Watts (1977). Their <72 index is an attempt to evaluate 
the number of matching alternatives between an examinee suspected to be a copier and 
another examinee believed to be the source against the expected number of matching 
alternatives. (For convenience, we will refer to these examinees just as ’’copier” and 
’’source”.) The problem inherent in a test of this nature is how to obtain the distribution 
of the index under the null hypothesis of no copying. Frary et al. attempted to solve 
this problem by establishing a null model that assumes that the probability of selecting 
an alternative on an item is a certain function of the copier’s number-correct score, the 
average number-correct score in the population, and the proportion of examinees in the 
population that selected the same alternative. Note that the first two quantities correct the 
probability of selecting an alternative for the examinee’s relative ability in the population. 

The iC-index (Holland, 1996; Lewis & Thayer, 1998) is the result of an attempt to 
correct more explicitly for the examinee’s ability. The index focuses only on the number 
of matching alternatives on the items that were answered incorrectly by the source. 
The null model is a binomial with a success parameter that is obtained by piecewise 
linear regression of the proportion of matching incorrect alternatives on the proportion of 
incorrect alternatives in each number-incorrect score group in a population of examinees. 
An alternative with quadratic regression is given in Sotaridona and Meijer (2002). 

The most elaborate null model for a test to detect copying is the one on which 
Wollack’s u index is based (Wollack, 1997; Wollack & Cohen, 1998). The probability 
of selecting an alternative on an item for an examinee that does not copy is assumed 
to follow the nominal response model (Bock, 1997). The ui index has the same shape 
as the <72 index but compares the observed number of matching alternatives against the 
(estimated) expected number under the nominal response model for all items in the test. 
Note that the use of the nominal response model automatically involves conditioning of 
the probabilities of choosing an alternative on the examinee’s ability. 



ERIC 



5 



Detecting Answer Copying - 4 



In spite of the attempts to condition on the examinee’s ability, a fundamental feature 
of all three tests is their dependency on the distribution of the item scores in the population 
of examinees. If the population changes, the results from these tests also change: the g 2 
index has to be calculated from a different proportion of times an alternative is chosen and 
a different average number-correct score in the population, the K index has to be based 
on a different regression equation, and the nominal response model has to be refitted with 
possible different parameter estimates or a less satisfactory fit. The fact that these three 
statistics are population dependent thus implies that the same pair of examinees may be 
tested to have been involved in answer copying in one population but not in another. 

The purpose of this paper is to present a statistical test to detect answer copying on 
multiple-choice tests that can be used when any reference to a population of examinees 
is undesirable. Obviously, we need a set of assumptions to derive a statistical test, but in 
the current case the assumptions are only about the response behavior of the individual 
examinee suspected of copying. In essence, the assumptions are based on the idea that 
an examinee who has access to the answers of a source can arrive at his/her own answer 
through three different processes: (1) knowing, (2) guessing or (3) copying. Examinees 
who do not have access to a source can only produce answers through the first two 
processes. No other assumptions are made. In particular, nothing is assumed about or 
inferred from the distribution of item scores in a population of examinees. Also, no 
assumption is made about the behavior of the examinee who may have served as a source 
to the copier. 



Derivation of the Ifest 

Like the K -index, the test focuses on the items for which the source has an incorrect 
answer. We will motivate this choice later by showing that an extension of the test 
to include items with correct answers by the source will lead to improper statistical 
hypotheses. 



Detecting Answer Copying - 5 



Assumptions 

The assumptions on the behavior of the copier on the items the source has answered 
incorrectly from which the test is derived are the following: First, if an examinee knows 
an item, he/she gives a correct answer. This assumption implies that if an examinee has 
access to the source but discovers his/her answer is incorrect, he/she does not copy but 
gives his/her own answer. Second, if an examinee does not know an item but has access to 
the source, he/she accepts the answer by the source and copies. Third, if an examinee does 
not know an item and does not have access to a source, he/she guesses blindly among the 
response alternatives. Thus, for each item incorrect by the source, we have three possible 
true states in which the copier can be, each characterized by a different probability of 
choosing the same alternative the source has chosen. 

We use the following notation to present these probabilities. Let i = 1, ..., I denote 
the items in the test and a = 1, ..., k the response alternatives for these items. In addition, 
index s and j are used for the source and the copier, respectively. The alternatives chosen 
by these two examinees on item i are denoted by random variables U S1 and Uji. The set 
of items for which s has chosen an incorrect alternative is denoted as W 3 . The size of this 
is denoted as w s . Finally, a (random) indicator variable I 3ji is used to identify the items 
for which examinees s and j have chosen the same alternative. That is, 



L SJl 



' 1 if U 3i = U 3j 
0 if U 3i t £ U 3j . 



The three possible probabilities for examinee y to choose the same alternative on the items 
in W 3 as s are the following: 



0 



Pr {I aji = 1) = < k~ l 
1 



if j knows the answer on i e W a , 
if j guesses blindly on i 6 W s , 
if j copies from s on i e W. 



( 1 ) 
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Hypotheses 

The hypothesis to be tested is that j did not copy any of the items in W 8 . We suggest 
to test this hypothesis against the alternative that j copied the answers for some of the 
items in W a which he/she did not know. Observe that this alternative is less extreme than 
the hypothesis of j copying all items in W s . Under the current alternative hypothesis, it 
is still possible that j actually knows some of the items in W a and for this reason did not 
copy them or that he/she did not have access to the answers by s for all of the items in 
W s . 

Let Kj be the number of items in the set W s examinee j knows and 7 . the number in 
this set examinee j copied from s. More formally, at the level of the set of items W s , the 
hypothesis to be tested is: 



Ho : Pr(/«i = 1) = | 



for Kj items in W s , 
for u a — Kj items in W a , 



(2) 



against 



H, : Pr (I tji = 1 ) = < 



0 

At 1 

l 



for Kj items in W 3 , 

for u a — Kj — 7 j items in W a , 

for 7 ■ items in W a , 



with Kj > 0, 7 j > 0, and Kj + 7. < w a . 



( 3 ) 



Distribution of Matching Incorrect Alternatives 

The proposed test statistic is the number of matching incorrect alternatives between 
j and s on the items in set W a : 



Z is = J2 ( 4 ) 

i ew. 

Both hypotheses imply distributions of Z ja belonging to a family with probability function 



P( z ',Ul a) Jj,Kj,k) 



0 

0 



for z < 7 j, 

for 7 j < z < w s — Kj, 
for w a — Kj < z, 

( 5 ) 
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with Kj > 0, jj > 0, and Kj+r/j < w a . The definition of this family follows from 
the fact that if j copies jj answers from W a , the probabilities of observing numbers of 
matches smaller than 7 ^ are each equal to zero. Likewise, if j knows Kj items in W a , the 
probabilities of observing number of matches larger than w 3 — Kj are each equal to zero. 
However, for the subset of w 3 — Kj — 7 j items that j does not know and for which (s)he 
has not copied any answer, the number of matches follows a binomial distribution with 
success parameter Ar 1 . Observe that the probability of Zj S = 7 . belongs to the compound 
event of j copying 7 ^ items and guessing none of the alternatives the source has chosen. 
Likewise, the probability of Zj S = w 3 — Kj belongs to the compound event of j coping 
7 j items, knowing Kj items, and guessing the alternatives the source has chosen on all 
remaining items. 

The function in (5) can be presented more compactly as 



p(z-,w s ,' 1 [ j ,K j ,k)= (^ S (2 ^\{k-l)k 1 ) u " 



( 6 ) 



where ify, »,-«*}(*) is an indicator function that is equal to 1 if 2 e {7 j,w s — Kj} 
and equal to 0 otherwise. Because p(z ; w a , 7 j, Kj , k) is nonzero for z £ { 7 ., w s — Kj}, 
this function indicates the support of the family of distributions in ( 6 ). In spite of the 
presence of the binomial expression in the definition of ( 6 ), the family is not the binomial 
over the range of possible values of Zj S . We will refer to this family as the ’’shifted 
binomial”, because it can be viewed as a binomial with its support shifted from { 0 , w s } 
to { 7 ^, w a — /Cj). The size of the shift is a critical quantity because it depends both on the 
(unknown) number of items j knows as well as the number j has copied. 

Statistical Ifest 

Under the distribution in ( 6 ), the two hypotheses in (2)-(3) simplify to 




Ho : lj = 0 



(7) 
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and 



H, : Jj > 0. (8) 

However, the null distribution under which the (right-sided) test of the hypothesis in 
(7) has to be conducted still depends on the unknown parameter Kj . We propose to conduct 
the test under the auxiliary assumption that j did not know any of the items in W a , that is, 
Kj = 0. This assumption gives us a test that tends to be more conservative than the one 
actually needed: From (6) it follows that the upper tail of the distribution for Kj = 0 is 
further to the right than the upper tails of the distributions for Kj > 0. As a result, setting 
Kj = 0 results in a critical value for the test laiger than the one needed for the (unknown) 
true value of Kj at the nominal level of significance. 

We feel the auxiliary assumption is permitted because it does thus not harm the copier 
in any way. The one who may have to pay a price for this assumption is the testing agent 
because of a loss of power of the test to detect answer copying. We will quantify the 
extent to which the critical value of the test is larger than actually needed as well as the 
differences in power resulting from this increase later in this paper. 

A (nonrandomized) test of the hypothesis of j not having copied any answer against 
the alternative of j having copied the answers of some of the items in W s with nominal 
significance level not laiger than a has as critical value for the test statistic Zj S in (4) ( the 
smallest value of 2 * for which the distribution in (6) yields 



Pr (Zj S > z*) < a. 



(9) 



Uniform Most Powerful Test 

For a statistical test it is desirable to be uniformly most powerful (UMP) at the level 
of significance chosen. From the Karlin-Rubin theorem (e.g., Casella & Berger, 1990, 
sect. 8.3.2) it follows that the above test is a UMP test with level associated with the 
critical value in (9) provided the family in (6) has a monotone likelihood ratio in Zj 3 and 
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Zj S is sufficient statistic for the number of items copied, 7 ^. It is easy to show that ( 6 ) has 
both properties for the case k 3 = 0 . 

As for the property of a monotone likelihood ratio, for the test of (7) against ( 8 ) it is 
sufficient to show that the ratio of ( 6 ) for 7 ^ = 7 > 0 and 7 • = 0 is nondecreasing in z. 
Simplifying, omitting constants, and cancelling factors, the ratio can be show to be equal 
to 

zl 

(z- 7 )!’ 

which is increasing in z. 

The fact that Z JS is a sufficient statistic for 7 • follows from the well-known 
factorization criterion. The factor 



((Jfc - l)* -1 )**-*'-* 

in ( 6 ) is independent of 7 ^ whereas its remaining part is dependent on 7 - and z. 

It is instructive to compare this result with those for a test of a point hypothesis for 
the success parameter in a regular binomial family, which also is UMR In the current case, 
( 6 ) is not the regular binomial and the parameter of interest is not a success parameter but 
a parameter that defines both the support of the distribution and the number of Bernoulli 
trials on which it is based. Observe also that ( 6 ) is not UMP with nominal level a but with 
the actual level of significance associated with (9). An exact level a test is only possible 
for a randomization version of (9). 

Finally, it is emphasized that the above result holds for the test in (9) which is based 
on the assumption kj = 0, but that it has not been shown that the test of (7) is UMP for 
an unknown value of kj. The impact of this parameter on the power of the test will be 
evaluated empirically in the next section. 
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Power of the Test 

The actual power of the above test is a function of the unknown number of items 
3 has copied from s, jj. The shape of the power function depends on (1) the number 
of alternatives per item, k, (2) the number of items shas incorrect, w 3 , and (3) the 
significance level chosen for the test, a, and (4) the number of items the examinee knows, 

Kj. 

We first present a set of power functions for the case k j = 0 for k = 2,..., 5, 
w s = 20, 30, 40, 50, and significance level a = .05. The functions were calculated by 
first finding the critical value z* for a = .05 in (9) under the distribution given by (6) with 
7 j = 0 and then calculating the probabilities Pr {Zj S > z*} under the same distribution 
for = 0, w s . The power functions are presented in Figure 1. From these functions 
it is clear that the test has considerable power to detect copying on multiple-choice tests, 
particularly if the number of response alternatives per item, k, goes up. But even for a 
test with three-choice items the power to detect copying is already perfect if the examinee 
has copied approximately half of the items in W s for w s = 20 or one third for w s = 50. 

[Insert Figure 1 about here] 

It was noted earlier that the auxiliary assumption of Kj = 0 leads to a test that tends 
to be conservative. Figure 2 illustrates the effect of the assumption of Kj = 0 on the 
critical value of the test for the same sets of parameter values as in Figure 1 (A: = 2, ..., 5; 
w s = 20, 30, 40, 50; a = .05). The curves show the critical values as a function of the 
number of items j knows, Kj. For example, the lower-left plot shows that for a test with 
four-choice items ( k = 4), the assumption Kj = 0 leads to a critical value for the test 
equal to = 17. However, if the examinee actually knows kj = 10 of the 40 items in 
the set W s , the critical value could have been lowered to z * = 14 to realize the nominal 
significance level of a = .05. Except for small horizontal pieces, which are due to the 
discreteness of test statistic Z js , all curves increase with K r This feature reflects the fact 
that the proposed statistical test is generally conservative, unless the assumption Kj = 0 
happens to be true. 
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[Insert Figure 2 about here] 

It was also noted that the price for using a conservative test is not paid by the 
examinee but by the testing agency in the form of less than optimal power. Figures 
3-6 show how much larger the power of the test would have been if we had known 
the true value of Kj. The curves in this figure show the increase in power relative to 
the power functions in Figure 1 . That is, the increase in power was calculated as the 
difference between the power of the test for the true value of Kj and Kj = 0 divided 
by the power for Kj = 0, and the result was plotted as a function of Figures 3-6 
show these functions for the same sets of parameter values as in the previous two figures 
(k — 2, ..., 5; w s = 20, 30, 40, 50; a = .05) and a selected set of values Kj. Each panel in 
these figures shows approximately the same pattern, which can be summarized as follows. 
First, knowing the true value of Kj would only lead to an increase of power for small values 
of 7 j . Second, the increase would be laiger, the smaller the number of alternatives per 
item, k. These two findings are consistent with the results in Figure 1 which show that 
the power curves are nearly equal to 1.0 for larger values of 7 ■ but approach this state at 
a somewhat lower rate for items with fewer alternatives. Once the power is close to one, 
there is hardly any space for improvement left. Third, the increase in power is generally 
laiger for large values of Kj, with an exception at the smallest values of 7 ., where for 
some of the larger values of Kj the assumption Kj = 0 actually appears to result in a small 
increase in power. These exceptions are due to discrete nature of the null distribution of 
the test and the definition of the critical value in (9). For larger values of Kj the actual level 
of Type 1 error can become smaller than a, and hence, for these values of Kj, the power 
of the test can become smaller than for Kj = 0. A randomized version of the test would 
not suffer from this problem, but the use of randomization in a statistical test to detect 
cheating on multiple-choice tests does not seem feasible. Fourth, for smaller values of Kj 
the power of the test appears to be remarkably robust and the assumption of Kj = 0 does 
not involve much loss of power over the entire range of values of 7 .. 
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[Insert Figures 3-6 about here] 

The proper way to use Figures 1-6 in an application is (1) to identify the size of the 
set of items the source has wrong, w 3 , (2) to inspect the power function for the number 
of alternatives per item, k, in the panel for this value of w 3 in Figure 1, (3) to find out 
in Figure 2 how much too large the critical value is for the various possible numbers of 
items the examinee knows, Kj, and (4) to use the plot for the actual value of k in Figure 
3-6 to determine how much laiger the power would have been if we had known Kj. 



Discussion 

The question can be raised if the test could not be improved by having the statistic 
in (4) also include the items on which the source chooses the correct alternative. Just like 
the g 2 and w index, this option would allow the statistical test to derives its power from 
all items in the tests rather than only those that s happens to answer correctly. 

However, let R s be the subset of items the source has correct. Analogous to (1), for 
the items in this set a copier can be in three possible true states, with the probabilities of 
a matching correct alternative given by: 



Pr (Iji = 1) = 



1 if j knows the answer on i € R s 
Ac -1 if j guesses blindly on i e R s 
1 if j copies from s on i e R 3 . 



( 10 ) 



Unfortunately, the probabilities for the events of j knowing and copying the answer are 
equal. In the current framework, it is thus impossible to extend the test with the items in 
R s , because the result would be a test confounding the difference between the events of 
j copying the answers s has correct and j knowing them. 

The only possibility to further improve the power of the proposed test seems to get 
more information about the number of items the examinee actually knows, Kj. The 
preceding analysis shows that this information can not come from the responses of the 
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examinee to the items in set R s . However, it can come from another source. For example, 
in a setting where an examinee retakes a test and shows an unusual increase in test scores, 
it may be possible to infer a lower bound to k : from the first test. Figure 2 and 3 show that, 
particularly for items with few alternatives or sources that have only a small number of 
items incorrect, conducting the test not at Kj — 0 but at a lower bound to deliberatively 
chosen to be conservative is likely to result in an increase in power that should not be 
disregarded. 

The third assumption on which the proposed statistical test rests is known as the 
’’model of knowing or blind guessing” in test theory. This assumption underlies the 3- 
parameter logistic model in item response theory (Bimbaum, 1968) and has been used to 
derive the correction for guessing on multiple-choice items widely known as ’’formula 
scoring”. In spite of its popularity, the assumption has been criticized because it ignores 
the fact that examinees may have partial knowledge. For example, they may be able 
to recognize some of the incorrect alternatives as wrong and guess blindly among the 
remaining alternatives or they may have information that helps them to guess the correct 
alternative with a probability larger than k~ l . 

For an individual examinee responding to an individual item, it may be hard to 
identify what actual process occurs if the examinee does not know the item. We doubt 
if it will ever be possible to open this black box and formulate a statistical model with 
satisfactory validity. However, for the family of distributions in (6) it is possible to 
evaluate the effect of partial knowledge of some of the items in W s on the proposed 
statistical test. Figure 2 shows for all parameter values that the critical value of the test 
never decreases but nearly always increases if (1) the number of alternatives per item 
decreases and/or (2) the number of items an examinees knows increases. If the examinee 
is thus able to exclude some of the alternatives as incorrect, the effect is a decrease in the 
effective number of alternatives for some of the items. Likewise, if (s)he has knowledge 
that leads to an increase of the probability of success on some of the items, the effect is an 
increase in the expected number of items the examinee has correct relative to an examinee 
who guesses blindly. This effect can be viewed as an increase in the number of items the 
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examinee knows. For both types of partial knowledge, the actual critical value for the 
test is higher than required, and again it is the testing agency and not the examinee who 
incurs the loss due to ignoring the possible presence of partial knowledge. 
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Figure Captions 



Figure 1. Power functions for k = 2, 5 and w a = 20, 30, 40, 50 at significance 
level a = .05. 

Figure 2. Critical values as a function of Kj for k = 2, ..., 5 and w 3 = 20, 30, 40, 50 
at significance level a = .05. 

Figure 3. Relative loss of power due to the number of items known by the examinee, 
Kj, for w s = 20, 30, 40, 50 at significance level a = .05 (and k = 2). 

Figure 4. Relative loss of power due to the number of items known by the examinee, 
Kj, for w s = 20, 30, 40, 50 at significance level a = .05 (and k = 3 ). 

Figure 5. Relative loss of power due to the number of items known by the examinee, 
Kj, for w a = 20, 30, 40, 50 at significance level a = .05 (and k = 4 ). 

Figure 6. Relative loss of power due to the number of items known by the examinee, 
Kj, for w s = 20, 30, 40, 50 at significance level a = .05 (and k = 5 ). 
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